skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Jitsev, Jenia"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation. 
    more » « less
    Free, publicly-accessible full text available April 21, 2026
  2. The rapidly growing use of imaging infrastructure in the energy materials domain drives significant data accumulation in terms of their amount and complexity. The applications of routine techniques for image processing in materials research are often ad hoc , indiscriminate, and empirical, which renders the crucial task of obtaining reliable metrics for quantifications obscure. Moreover, these techniques are expensive, slow, and often involve several preprocessing steps. This paper presents a novel deep learning-based approach for the high-throughput analysis of the particle size distributions from transmission electron microscopy (TEM) images of carbon-supported catalysts for polymer electrolyte fuel cells. A dataset of 40 high-resolution TEM images at different magnification levels, from 10 to 100 nm scales, was annotated manually. This dataset was used to train the U-Net model, with the StarDist formulation for the loss function, for the nanoparticle segmentation task. StarDist reached a precision of 86%, recall of 85%, and an F1-score of 85% by training on datasets as small as thirty images. The segmentation maps outperform models reported in the literature for a similar problem, and the results on particle size analyses agree well with manual particle size measurements, albeit at a significantly lower cost. 
    more » « less